Problem Set 4 - Solutions

library(ggraph)
library(tidygraph)
library(tidymodels)
library(tidytext)
library(tidyverse)
library(topicmodels)
my_theme <- theme_bw() +
  theme(
    panel.background = element_rect(fill = "#f7f7f7"),
    panel.grid.minor = element_blank(),
    axis.ticks = element_blank(),
    plot.background = element_rect(fill = "transparent", colour = NA)
  )
theme_set(my_theme)

Political Book Recommendations

Scoring

  • a and b, Design (0.5 points each): Creative and readable design (0.5 points), generally appropriate but lacking critical attention (0.5 points), difficult to read (0 points).
  • a and b, Code (0.5 points each): Clear and concise (0.5 points), correct but unnecessarily complex (0.25 points), missing (0 points)
  • c, Discussion (1 point): Correct and well-developed interpretation (1 point), correct but somewhat underdeveloped (0.5 points), missing or incorrect interpretations (0 points).

Question

In this problem, we’ll study a network dataset of Amazon bestselling US Politics books. Books are linked by an edge if they appeared together in the recommendations (“customers who bought this book also bought these other books”).

Example Solution

  1. The code below reads in the edges and nodes associated with the network. The edges dataset only contains IDs of co-recommended books, while the nodes data includes attributes associated with each book. Build a tbl_graph object to store the graph.
library(tidygraph)
edges <- read_csv("https://uwmadison.box.com/shared/static/54i59bfc5jhymnn3hsw8fyolujesalut.csv", col_types = "cci")
nodes <- read_csv("https://uwmadison.box.com/shared/static/u2x392i79jycubo5rhzryxjsvd1jjrdy.csv", col_types = "ccc")
G <- tbl_graph(nodes, edges)
  1. Use the result from part (a) to visualize the network as a node-link diagram. Include the book’s title in the node label, and shade in the node according to political ideology.

We can use geom_node_label and geom_edge_link to draw the node-link diagram. Customizing the edge and node properties gives a more readable view.

ggraph(G, "kk") +
  geom_edge_link(width = 1.5, color = "#d3d3d3", alpha = 0.5) +
  geom_node_label(aes(fill = political_ideology, label = Label)) +
  scale_fill_manual(values = c("#F26849", "#5377A6", "#BFBFBF")) +
  theme_void()

  1. Create the analogous adjacency matrix visualization. Provide examples of visual queries that are easy to answer using one encoding but not the other (i.e., what is easy to see in the node-link view vs. what is easy to see in the adjacency matrix).

We can use geom_edge_tile and geom_node_point to draw the adjacency matrix and associate each row / column with a political ideology. Topological queries are easier to ask in the node-link view (e.g., which books lie within two steps of “The O’Reiley Factor”). Questions about the degree distribution are easier to answer in the adjacency matrix view, because we can refer to the number of black squares along individual columns / rows.

ggraph(G, "matrix") +
  geom_edge_tile(mirror = TRUE) +
  geom_node_point(aes(col = political_ideology), x = -1) +
  geom_node_point(aes(col = political_ideology), y = nrow(data.frame(G)) + 1) +
  scale_color_manual(values = c("#F26849", "#5377A6", "#BFBFBF")) +
  coord_fixed() +
  theme_void() +
  theme(legend.position = "bottom")

Topics in Pride and Prejudice

Scoring

  • a and b, Design (0.5 points each): Creative and readable design (0.5 points), generally appropriate but lacking critical attention (0.5 points), difficult to read (0 points).
  • a and b, Code (0.5 points each): Clear and concise (0.5 points), correct but unnecessarily complex (0.25 points), missing (0 points)
  • c, Discussion (1 point): Correct and well-developed interpretation (1 point), correct but somewhat underdeveloped (0.5 points), missing or incorrect interpretations (0 points).

Question

This problem uses LDA to analyze the full text of Pride and Prejudice. The object paragraph is a data.frame whose rows are paragraphs from the book. We’ve filtered very short paragraphs; e.g., from dialogue. We’re interested in how the topics appearing in the book vary from the start to the end of the book, for example.

Example Solution

  1. Create a Document-Term Matrix containing word counts from across the same paragraphs. That is, the \(i^{th}\) row of dtm should correspond to the \(i^{th}\) row of paragraph. Make sure to remove all stopwords.
paragraphs <- read_csv("https://uwmadison.box.com/shared/static/pz1lz301ufhbedzsj9iioee77r95xz4v.csv") 
dtm <- paragraphs %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(paragraph, word) %>%
  cast_dtm(paragraph, word, n)
  1. Fit an LDA model to dtm using 6 topics. Set the seed by using the argument control = list(seed = 479) to remove any randomness in the result.

We can use the LDA function to estimate the topic model.

fit <- LDA(dtm, k = 6, control = list(seed = 479))
  1. Visualize the top 30 words within each of the fitted topics. Specifically, create a faceted bar chart where the lengths of the bars correspond to word probabilities and the facets correspond to topics. Reorder the bars so that each topic’s top words are displayed in order of decreasing probability.

To ensure that the bars appear in decreasing order for each facet, we use the reorder_within and scale_x_reordered() functions from the tidytext package. A few of the words are common across all the topics (e.g., Elizabeth and Darcy), so an approach using discriminative words might be more informative. Nonetheless, we can still distinguish a few key episodes from the text appearing as different topics, like Lydia’s elopement with Wickham in topic 5.

beta <- tidy(fit, matrix = "beta") %>%
  group_by(topic) %>%
  top_n(30, beta) %>%
  mutate(term = reorder_within(term, -beta, topic))

ggplot(beta) +
  geom_col(aes(x = term, y = beta)) +
  labs(x = "Term", y = expression(beta)) +
  facet_wrap(~ topic, ncol = 3, scale = "free_x") +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0, 0, 0.1, 0)) +
  theme(axis.text.x = element_text(angle = 90, size = 8, vjust = .5))
Example topic proportions for problem 5(b).

Example topic proportions for problem 5(b).

  1. Find the paragraph that is the purest representative of Topic 2. That is, if \(\gamma_{ik}\) denotes the weight of topic \(k\) in paragraph \(i\), then print out paragraph \(i^{\ast}\) where \(i^{\ast} = \arg \max_{i}\gamma_{i2}\). Verify that the at least a few of the words with high probability for this topic appear. Only copy the first sentence into your solution.

To find the document with the largest gamma in topic 2, we can use the top_n command. Then, we filter to this paragraph and print out the text.

gamma <- tidy(fit, matrix = "gamma")
top_ix <- gamma %>%
  filter(topic == 2) %>%
  top_n(1) %>%
  pull(document)

sentence <- paragraphs %>%
  filter(paragraph == top_ix) %>%
  pull(text)
substr(sentence, 1, 120)
## [1] "sir william and lady lucas were speedily applied to for their consent; and it was bestowed with a most joyful alacrity. "

Food Nutrients

Scoring

  • a and b, Design (0.5 points each): Creative and readable design (0.5 points), generally appropriate but lacking critical attention (0.5 points), difficult to read (0 points).
  • a and b, Code (0.5 points each): Clear and concise (0.5 points), correct but unnecessarily complex (0.25 points), missing (0 points)
  • c, Discussion (1 point): Correct and well-developed interpretation (1 point), correct but somewhat underdeveloped (0.5 points), missing or incorrect interpretations (0 points).

Question

This problem will use PCA to provide a low-dimensional view of a 14-dimensional nutritional facts dataset. The data were originally curated by the USDA and are regularly used in visualization studies.

nutrients <- read_csv("https://uwmadison.box.com/shared/static/nmgouzobq5367aex45pnbzgkhm7sur63.csv")

Example Solution

  1. Define a tidymodels recipe that normalizes all nutrient features and specifies that PCA should be performed.

The code below defines a recipe that first normalizes the predictors and then performs PCA.

pca_recipe <- recipe(~., data = nutrients) %>%
  update_role(name, id, starts_with("group"), new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())
pca_prep <- prep(pca_recipe)
  1. Visualize the top 6 principal components. What types of food do you expect to have low or high values for PC1 or PC2?

The first 6 components are plotted below. Each component can be thought of as a derived feature or contrast that explains a significant amount of the variation across the foods. Since the first PC has a very positive weight on water and a very negative weight on calories and fat, we suspect that it distinguishes between fruit / vegetables and everything else. Since the second PC has very positive values for carbohydrates and very negative values for water and fats, we suspect that these are starchy foods, like bread or pasta.

pca_result <- tidy(pca_prep, 2) %>%
  mutate(terms = str_replace(terms, " \\(g\\)", ""))
ggplot(pca_result %>% filter(component %in% str_c("PC", 1:6))) +
  geom_col(aes(x = value, y = terms)) +
  facet_wrap(~ component) +
  labs(x = "Component", y = "Features")
An example result for problem 3(b).

An example result for problem 3(b).

  1. Compute the average value of PC2 within each category of the group column. Give the names of the groups sorted by this average.

The bake function below extracts the PC scores for each sample. We then compute the mean of PC2 within each food group using the group_by and summarise pattern.

pca_scores <- bake(pca_prep, nutrients)
group_order <- pca_scores %>%
  group_by(group) %>%
  summarise(mpc2 = mean(PC2)) %>%
  arrange(mpc2) %>%
  pull(group)
  1. Visualize the scores of each food item with respect to the first two principal components. Facet the visualization according to the group column, and sort the facets according to the results of part (c). How does the result compare with your guess from part (b)?

We mutate the group column so that the facets will be reordered according to the results of (c). In general, the facets of a ggplot2 plot will always be ordered by the factor levels of the defining variable. From here, the result is a standard scatterplot, though we make the axes slightly bolder and customize the color and axes labels.

pca_scores %>%
  mutate(group = factor(group, levels = group_order)) %>%
  ggplot(aes(x = PC1, y = PC2)) +
  geom_vline(xintercept = 0, col = "#4a4a4a") +
  geom_hline(yintercept = 0, col = "#4a4a4a") +
  geom_point(size = 0.4, alpha = 0.6) +
  scale_x_continuous(breaks = seq(-8, 0, length.out = 3)) +
  scale_color_brewer(palette = "Set2") +
  facet_wrap(~ group, ncol = 9) +
  coord_fixed() +
  theme(strip.text = element_text(size = 8))
An example result for problem 5(d).

An example result for problem 5(d).

Interactive Phylogeny

Scoring

  • a, Design (0.5 points each): Creative and readable design (0.5 points), generally appropriate but lacking critical attention (0.25 points), difficult to read (0 points).
  • a, Code (0.5 points): Clear and concise (0.5 points), correct but unnecessarily complex (0.25 points), missing (0 points)
  • b, Design (1 point): Creative and readable design (1 points), generally appropriate but lacking critical attention (0.5 points), difficult to read (0 points).
  • b, Code (1 point): Clear and concise (1 point), correct but unnecessarily complex (0.5 points), missing (0 points)
  • c, Discussion (1 point): Correct and well-developed proposal (1 point), correct but somewhat underdeveloped (0.5 points), missing or incorrect statements (0 points).

Question

We will build an interactive phylogenetic tree of SARS-CoV-2 genetic sequences. Each sequence has been annotated with a date and location of its discovery. We will use D3 to allow readers to explore the way genetic changes unfold over time and space. You can find the raw data here: nodes, edges. We have provided starter code to build a d3.stratify() object from the edge data and to define an object, node_lookup, which can be used to look up the country and date associated with the from and to fields in the edges.

Example Solution

  1. Create a static tree visualization that shows how the different COVID variants evolved from one another. Use color to encode the location of the variant’s discovery. You may group rare countries into “Other,” and draw variants with unknown origins using either white or grey.

Our full implementation can be read here.

The main steps to generate this tree are,

  • Build a data structure storing the coordinates of the nodes and endpoint ids along the edges.
  • Bind circles and paths that visually encode this data structure.

The function make_tree supports the first step, converting the raw edgelist into a stratified, tree-structured object. The d3.stratify() is used to parse the edge list into a tree structure, and d3.tree() lays out the nodes onto a canvas of size \([400, 400]\). Note that these functions produce generators, their outputs can be applied to any set of edges, but they don’t directly contain the structured and laid out data themselves.

function make_tree(edges) {
  edges.push({to: 1, from: null});
  stratifier = d3.stratify(edges)
    .id(d => d.to)
    .parentId(d => d.from)
  tree_gen = d3.tree(tree)
    .size([400, 400])

  return tree_gen(stratifier(edges))
}

The draw_tree() function below accomplishes step 2, taking the structured tree output and appending paths and circles to the canvas. The first block, with .selectAll("path") draws the links in the tree, while the second part, with selectAll("circle"), draws the nodes. Note that we are using d3.linkVertical() to define the smooth path coordinates. Moreover, we had to use a nodes_lookup object to map the node IDs with their original attributes. This is because tree was constructed using only the edge information passed into make_tree() – it does not contain the node information itself. Therefore, we created a separate object which maps each node’s ID to a small object containing the encodable attributes.

function draw_tree(tree, nodes_lookup) {
  let link = d3.linkVertical()
    .x(d => d.x)
    .y(d => d.y)

  d3.select("#tree")
    .selectAll("path")
    .data(tree.links()).enter()
    .append("path")
    .attrs({
      d: link,
      "stroke-width": 2,
      stroke: "#d3d3d3"
    })

  d3.select("#tree")
    .selectAll("circle")
    .data(tree.descendants()).enter()
    .append("circle")
    .attrs({
      cx: d => d.x,
      cy: d => d.y,
      fill: d => scales.fill(nodes_lookup[d.id].country),
      r: 10
    })}

This nodes_lookup variable was defined in this way. The i + 1 is needed because the node IDs are indexed starting at 1 rather than 0.

nodes_lookup = {}
for (let i = 0; i < nodes.length; i++) {
  nodes_lookup[i + 1] = nodes[i]
}
  1. Implement one of the following forms of interactivity,
    1. Provide a selection menu that highlights countries that the user selects.
    2. As the user hovers near to a node, highlight all of its ancestors. Blend the rest of the nodes into the background.
    3. As the user brushes a collection of nodes, highlight only those nodes under the brush. Blend the rest of the nodes into the background.

The solution below implements both (i) and (ii). The country multiselect input follows the Gapminder example from the notes, and the hover interactivity is based on the Week 9 In-Class Demo. Specifically, the hover interactivity is implemented by the update_labels() function and the response to the dropdown menu is provided by highlight_countries(). As before, the full implementation can be read here.

  1. Propose, but do not implement, an extend version of part (b) that is linked with an additional table or visualization. How would the second graphic be updated in response to user interactions? What additional queries become possible in your proposed visualization?

There are many potential solutions to this problem. Here are some possibilities,

  • We could link the tree with a timeline with time on the \(x\) axis, countries on the \(y\) axis, and dots indicating when a particular strain was discovered. By hovering over the tree, we can highlight the nodes in the timeline to match the the highlighting of descendants / ancestors in the tree. Brushing on the timeline could highlight the corresponding nodes on the tree. This linking would allow us to see how quickly some strains evolved from others over the course of the tree. It would also allow support queries like about individual time windows, like the number of strains that appeared or their countries of origin.
  • We could link the tree with a bar chart giving the counts of the numbers of strains discovered in different countries. Clicking on the bar chart could be used as a substitute for selecting countries from a dropdown menu. Hovering over a node can update the bar chart to show the country composition for ancestor and descendant nodes in the current selection. This would simplfiy comparisons of country totals along different subtrees. Comparing bar heights would be much easier than guessing the number of circles of different colors, especially when the counts are close.
  • We could link a table to the original tree so that brushing over a set of nodes will populate the table with just those nodes that are in the current selection. The table can include as detailed information as we want, and we would now be able to query that table data by referring to the tree. For example, we could include the full ATGC sequence of each strain, and hovering the tree could highlight the locations of the specific mutations that led to new strains.

Hierarchical Edge Bundling

Scoring

  • a (1 point): Correct and complete explanation of root object (1 point), partial explanation of root object (0.5 points), incorrect or missing explanation (0 points).

  • b (1.5 points): Correct interpretation of the .attr line and explanation of why paths have length longer than 2 (1.5 point), partially correct interpretations and explanations (0.75 point), missing or entirely incorrect explanation (0 points).

  • c (0.5 points): Correct identification of importance of hierarchical structure (0.5 points), incorrect or missing explanation for importance of hierarchical structure (0 points).

  • a - d, Discussion for role (3/8 points each): Complete and accurate description of function (3/8 points), generally correct but either vague or copied from notes (1/8 points), incorrect or incomplete (0 points)

  • a - d, Discussion for example (3/8 points each): Creative and clearly explained scenario (3/8 points), correct but vague or copied from notes (1/8 points), inaccurate or underdeveloped example (0 points).

Question

In this problem, we will study a D3 hierarchical edge bundling implementation available at this link. The display shows how different files in a software package import from one another. Unlike a naive radial node-link layout, this layout “bundles” together edges if their source and target nodes have common ancestors in the package’s directory tree (which is why the resulting layout is called a “Hierarchical Edge Bundling”).

Example Solution

  1. Use console.log() to inspect the root object. Describe its structure.

A screenshot of the root object is given below. The object is a tree-structured JS object with nesting structure specified by the children attribute. The data attribute at each level provides the data associated with each node in the tree. Each node additionally includes an x and y property which provides the coordinate location for the node on the SVG canvas.

Based on this output, the key observation is that though the hierarchical edge bundling does not display a tree, it is implicitly based on one. Specifically, each label corresponds to one leaf in the tree. The center of the diagram gives the tree’s root position, and intermediate levels of the hierarchy are present (but invisible) in the middle of the circle. The curves are formed by asking the paths to approximate the hidden, radial tree structure, rather than simply directly linking endpoints.

  1. What does this line do?

    .attr("d", ([i, o]) => line(i.path(o)))

    Provide one example of an edge in the original visualization (e.g., for example xor <--> or, though this is not a correct answer) where you believe i.path(o) contains more than two elements, and explain your reasoning. You may find it useful to console.log() the result from i.path(o).

The d attribute returns the path coordinates for a single link in the network. i and o represent the input and output nodes at the endpoints of the path. If we draw a line like [i, o], it would be a straight line from input to output. The i.path(o) call extracts an array of all nodes on the shortest path along the tree between two leaves, one for i and another for o. By drawing a line through the full path, we encourage the connection between the leaves to curve closer to the center of the circle. Moreover, if two pairs of endpoints share many nodes along the shortest tree path, then they will curve towards similar nodes. This is the source of the bundling effect.

All pairs of nodes have paths with at least two elements, because they must go up to at least their common ancestor along the tree. That is why every link in the diagram is curved, even those for adjacent leaves.

  1. Imagine that you are working for a biotechnology firm that is interested in visualizing a protein network. You have data on the co-occurrence frequency for all pairs of proteins (high-co-occurrence can be interpreted as the proteins lying on a shared regulatory pathway). What, if any, additional information would you need before you could implement a hierarchical edge bundling visualization of the network? Explain your reasoning.

We will need to be able to arrange all the proteins onto a hierarchy, so edges alone are not enough. For example, if we know that the proteins belonged to a nested set of categories, each performing a different type of function, then we could build a tree structure containing them. With this information, we could then build an edge bundling view.

UMAP Image Collection

Scoring

  • a, Discussion (1.5 points): Correct identification of the panning and zooming functions (1.5 points), identification of some, but not all, panning and zooming functions (0.75 points), incorrect identification of the panning and zooming functions (0 points).
  • b, Discussion (1.5 points): Response demonstrates understanding of overview + detail or focus + context principles and considers both advantages and disadvantages (1.5 points), response considers either advantages or disadvantages, but not both, or does not refer to relevant visualization concepts (0.75 points), response is underdeveloped, and it is unclear what the analysis’ main ideas are (0 points).

Question

We will analyze the visualization available at this link, which supports exploration of artworks in the Staatliche Museen zu Berlin. The visualization shows the results of applying UMAP to the high-dimensional image features extracted using a pretrained deep learning model (if you are curious, this notebook gives details). It is implemented using a combination of D3 and a graphics library called PIXI (which we won’t be covering).

Example Solution

  1. This visualization supports panning and zooming. Which lines of code support this?

Panning and zooming are supported by two blocks in this notebook. The first block defines the zooming object,

zoom = d3.zoom()
  .scaleExtent([1, 50])
  .translateExtent([[0,0],[width,height]])
  .clickDistance(2)
  .on("zoom", zoomed);

scaleExtent defines the minimum and maximum zoom levels, translateExtent gives the range of x and y values for the canvas. The zoomed function is defined in the second block, and it defines transformations of the Canvas’ appearance every time the user initiates a zoom event (don’t worry about the syntax here, it uses techniques that we haven’t covered).

function zoomed() {
  const {transform} = d3.event;
  container.scale.set(transform.k)
  container.position.x = transform.x
  container.position.y = transform.y
  renderer.render(container)
}

A final detail: the reset function programmatically adjusts the zoom level, independent of any panning or zooming by the user.

  1. This visualization applies a “fisheye” lens in addition to more standard pan and zoom. Why do you think this was included? Do you think it is effective? Why or why not?

A fisheye lens applies a distortion to implement the overview + detail principle. It allows us to zoom into individual pieces without losing context of the global arrangement of artworks. For effectiveness, you could have answered yes or no, as long as your answer was fully justified. Some arguments for and against this visualization include,

  • Pro: Moving the mouse and locally updating a view is more streamlined than zooming the entire screen in and out. When we zoom in the standard way, the entire field of vision becomes expanded or contracted, while in a fisheye, we only modify a part of the screen.
  • Pro: The images lying at the center of the fisheye are relatively readable, even when the original icons are difficult to make out properly.
  • Con: The fisheye transformation can be disorienting, drawing more attention to the distortion than the images within it.
  • Con: The shape and size of the fisheye distortion is fixed, which makes it hard to zoom into larger collections of images.

It’s worth noting that, though there are a variety of research papers advocating for fisheye lens in data visualization, the author of this visualization removed the functionality in the final version of the interface, opting instead for guided zooms into and out of presepcified regions.